Search CORE

160 research outputs found

Performance of random forest when SNPs are in linkage disequilibrium

Author: A Bureau
C Strobl
DF Schwarz
DJ Schaid
EM Reiman
JH Friedman
K Nicodemus
Kathryn L Lunetta
KJ Archer
KL Lunetta
L Adrienne Cupples
L Breiman
L Breiman
L Breiman
L Breiman
Lindsay A Farrer
N Risch
R Díaz-Uriarte
S Purcell
Y Freund
Y Meng
Yan A Meng
Yi Yu
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Abstract Background Single nucleotide polymorphisms (SNPs) may be correlated due to linkage disequilibrium (LD). Association studies look for both direct and indirect associations with disease loci. In a Random Forest (RF) analysis, correlation between a true risk SNP and SNPs in LD may lead to diminished variable importance for the true risk SNP. One approach to address this problem is to select SNPs in linkage equilibrium (LE) for analysis. Here, we explore alternative methods for dealing with SNPs in LD: change the tree-building algorithm by building each tree in an RF only with SNPs in LE, modify the importance measure (IM), and use haplotypes instead of SNPs to build a RF. Results We evaluated the performance of our alternative methods by simulation of a spectrum of complex genetics models. When a haplotype rather than an individual SNP is the risk factor, we find that the original Random Forest method performed on SNPs provides good performance. When individual, genotyped SNPs are the risk factors, we find that the stronger the genetic effect, the stronger the effect LD has on the performance of the original RF. A revised importance measure used with the original RF is relatively robust to LD among SNPs; this revised importance measure used with the revised RF is sometimes inflated. Overall, we find that the revised importance measure used with the original RF is the best choice when the genetic model and the number of SNPs in LD with risk SNPs are unknown. For the haplotype-based method, under a multiplicative heterogeneity model, we observed a decrease in the performance of RF with increasing LD among the SNPs in the haplotype. Conclusion Our results suggest that by strategically revising the Random Forest method tree-building or importance measure calculation, power can increase when LD exists between SNPs. We conclude that the revised Random Forest method performed on SNPs offers an advantage of not requiring genotype phase, making it a viable tool for use in the context of thousands of SNPs, such as candidate gene studies and follow-up of top candidates from genome wide association studies.</p

Crossref

Boston University Institutional Repository (OpenBU)

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

SNPInterForest: A new method for detecting epistatic interactions

Author: A Bureau
Asako Koike
C Yang
CW Gini
DR Velez
J Hoh
J Marchini
KL Lunetta
L Breiman
M Ritchie
Makiko Yoshida
MI McCarthy
R Culverhouse
R Jiang
TK Rice
Wellcome Trust Case Control Consortium
X Wan
Y Zhang
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background Multiple genetic factors and their interactive effects are speculated to contribute to complex diseases. Detecting such genetic interactive effects, i.e., epistatic interactions, however, remains a significant challenge in large-scale association studies. Results We have developed a new method, named SNPInterForest, for identifying epistatic interactions by extending an ensemble learning technique called random forest. Random forest is a predictive method that has been proposed for use in discovering single-nucleotide polymorphisms (SNPs), which are most predictive of the disease status in association studies. However, it is less sensitive to SNPs with little marginal effect. Furthermore, it does not natively exhibit information on interaction patterns of susceptibility SNPs. We extended the random forest framework to overcome the above limitations by means of (i) modifying the construction of the random forest and (ii) implementing a procedure for extracting interaction patterns from the constructed random forest. The performance of the proposed method was evaluated by simulated data under a wide spectrum of disease models. SNPInterForest performed very well in successfully identifying pure epistatic interactions with high precision and was still more than capable of concurrently identifying multiple interactions under the existence of genetic heterogeneity. It was also performed on real GWAS data of rheumatoid arthritis from the Wellcome Trust Case Control Consortium (WTCCC), and novel potential interactions were reported. Conclusions SNPInterForest, offering an efficient means to detect epistatic interactions without statistical analyses, is promising for practical use as a way to reveal the epistatic interactions involved in common complex diseases.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Use of principal components to aggregate rare variants in case-control and family-based association studies in the presence of multiple covariates

Author: AL Price
B Li
BE Madsen
C Dering
D Rabinowitz
DJ Schaid
F Han
H Zou
John S Witte
KL Lunetta
LA Almasy
NM Laird
R Tibshirani
Rémi Kazma
S Morgenthaler
S Nejentsev
Thomas J Hoffmann
TJ Hoffmann
W Bodmer
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Rare variants may help to explain some of the missing heritability of complex diseases. Technological advances in next-generation sequencing give us the opportunity to test this hypothesis. We propose two new methods (one for case-control studies and one for family-based studies) that combine aggregated rare variants and common variants located within a region through principal components analysis and allow for covariate adjustment. We analyzed 200 replicates consisting of 209 case subjects and 488 control subjects and compared the results to weight-based and step-up aggregation methods. The principal components and collapsing method showed an association between the gene FLT1 and the quantitative trait Q1 (P<10−30) in a fraction of the computation time of the other methods. The proposed family-based test has inconclusive results. The two methods provide a fast way to analyze simultaneously rare and common variants at the gene level while adjusting for covariates. However, further evaluation of the statistical efficiency of this approach is warranted

Crossref

Springer - Publisher Connector

PubMed Central

eScholarship - University of California

A Two-Stage Random Forest-Based Pathway Analysis Method

Author: A Bureau
A Torkamani
DF Schwarz
DJ Hunter
EE Calle
H Eleftherohorinou
H Lind
H Pang
HJ Cordell
JS Chang
K Wang
K Wang
KL Lunetta
L Breiman
L De Lobel
LS Chen
M Kanehisa
MD Mailman
N Chatterjee
P Holmans
P Scheet
Ren-Hua Chung
S Purcell
SG Park
SG Park
TL Edwards
X Zhang
Xi-Nian Zuo
YA Meng
Ying-Erh Chen
Publication venue: Public Library of Science
Publication date: 01/01/2012
Field of study

Pathway analysis provides a powerful approach for identifying the joint effect of genes grouped into biologically-based pathways on disease. Pathway analysis is also an attractive approach for a secondary analysis of genome-wide association study (GWAS) data that may still yield new results from these valuable datasets. Most of the current pathway analysis methods focused on testing the cumulative main effects of genes in a pathway. However, for complex diseases, gene-gene interactions are expected to play a critical role in disease etiology. We extended a random forest-based method for pathway analysis by incorporating a two-stage design. We used simulations to verify that the proposed method has the correct type I error rates. We also used simulations to show that the method is more powerful than the original random forest-based pathway approach and the set-based test implemented in PLINK in the presence of gene-gene interactions. Finally, we applied the method to a breast cancer GWAS dataset and a lung cancer GWAS dataset and interesting pathways were identified that have implications for breast and lung cancers

Crossref

National Health Research Institues

Directory of Open Access Journals

PubMed Central

A polymorphic variant of the insulin-like growth factor 1 (IGF-1) receptor correlates with male longevity in the Italian population: a genetic study and evaluation of circulating IGF-1 from the "Treviso Longeva (TRELONG)" study

Author: A Antebi
A Herskind
A Ruiz-Torres
AI Yashin
Andrea Zanardo
Angelica Vittori
AR Cameron
AV Samuelson
C Selman
CT Murphy
D Lio
D McCulloch
D van Heemst
Diego Albani
G Passarino
Gianluigi Forloni
Giovanni Battista Gajo
J Cheng
J Garcia
J Pinkston-Gosse
J vB Hjelmborg
JM Harper
K Hall
KL Lunetta
LA Herndon
Letizia Polito
M Barbieri
M Bonafè
M Gallucci
Marzia Pesaresi
Maurizio Gallucci
N Shved
S Rodriguez
Sara Batelli
Sergio De Angeli
SK Kim
T Cederholm
T Kojima
T Münzer
Y Suh
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Abstract Background An attenuation of the insulin-like growth factor 1 (IGF-1) signaling has been associated with elongation of the lifespan in simple metazoan organisms and in rodents. In humans, IGF-1 level has an age-related modulation with a lower concentration in the elderly, depending on hormonal and genetic factors affecting the IGF-1 receptor gene (<it>IGF-1R</it>). Methods In an elderly population from North-eastern Italy (<it>n </it>= 668 subjects, age range 70–106 years) we investigated the <it>IGF-1R </it>polymorphism G3174A (<it>rs2229765</it>) and the plasma concentration of free IGF-1. Frequency distributions were compared using χ2-test "Goodness of Fit" test, and means were compared by one-way analysis of variance (ANOVA); multiple regression analysis was performed using JMP7 for SAS software (SAS Institute, USA). The limit of significance for genetic and biochemical comparison was set at α = 0.05. Results Males showed an age-related increase in the A-allele of <it>rs2229765 </it>and a change in the plasma level of IGF-1, which dropped significantly after 85 years of age (85+ group). In the male 85+ group, A/A homozygous subjects had the lowest plasma IGF-1 level. We found no clear correlation between <it>rs2229765 </it>genotype and IGF-1 in the females. Conclusion These findings confirm the importance of the <it>rs2229765 </it>minor allele as a genetic predisposing factor for longevity in Italy where a sex-specific pattern for IGF-1 attenuation with ageing was found.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Conditional variable importance for random forests

Author: A Bureau
Achim Zeileis
Anne-Laure Boulesteix
BJ van Os
C Strobl
C Strobl
C Strobl
Carolin Strobl
E Bauer
JH Silber
K Nicodemus
KJ Archer
KL Lunetta
L Breiman
L Breiman
L Breiman
L Breiman
L Breiman
M Nason
MR Segal
Mvan der Laan
P Bühlmann
P Good
R Development Core Team
R Diaz-Uriarte
R Diaz-Uriarte
R Feraud
SM Stigler
T Hastie
T Hothorn
TG Dietterich
Thomas Augustin
Thomas Kneib
V Svetnik
W Rodenburg
X Huang
X Xia
Y Lin
Y Qi
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Random forests are becoming increasingly popular in many scientific fields because they can cope with ``small n large p'' problems, complex interactions and even highly correlated predictor variables. Their variable importance measures have recently been suggested as screening tools for, e.g., gene expression studies. However, these variable importance measures show a bias towards correlated predictor variables. We identify two mechanisms responsible for this finding: (i) A preference for the selection of correlated predictors in the tree building process and (ii) an additional advantage for correlated predictor variables induced by the unconditional permutation scheme that is employed in the computation of the variable importance measure. Based on these considerations we develop a new, conditional permutation scheme for the computation of the variable importance measure. The resulting conditional variable importance is shown to reflect the true impact of each predictor variable more reliably than the original marginal approach

CiteSeerX

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Open Access LMU

Elektronische Publikationen der Wirtschaftsuniversität Wien

Exceptionally low likelihood of Alzheimer's dementia in APOE2 homozygotes from a 5,000-person neuropathological study

Each additional copy of the apolipoprotein E4 (APOE4) allele is associated with a higher risk of Alzheimer's dementia, while the APOE2 allele is associated with a lower risk of Alzheimer's dementia, it is not yet known whether APOE2 homozygotes have a particularly low risk. We generated Alzheimer's dementia odds ratios and other findings in more than 5,000 clinically characterized and neuropathologically characterized Alzheimer's dementia cases and controls. APOE2/2 was associated with a low Alzheimer's dementia odds ratios compared to APOE2/3 and 3/3, and an exceptionally low odds ratio compared to APOE4/4, and the impact of APOE2 and APOE4 gene dose was significantly greater in the neuropathologically confirmed group than in more than 24,000 neuropathologically unconfirmed cases and controls. Finding and targeting the factors by which APOE and its variants influence Alzheimer's disease could have a major impact on the understanding, treatment and prevention of the disease

UCL Discovery

A Functional Polymorphism in Renalase (Glu37Asp) Is Associated with Cardiac Hypertrophy, Dysfunction, and Ischemia: Data from the Heart and Soul Study

Author: AJ Coats
AS Go
AS Levey
AS Tan
B Ruo
BA Vakili
Beeya Na
BF Culleton
CK Morris
D Levy
G Li
Gary V. Desir
GC Fonarow
GV Desir
GV Desir
Harald H. H. W. Schmidt
J Abrams
J Abrams
J Neumann
J Xu
J Xu
JA Joles
JK Oh
JS Silberberg
KL Lunetta
Mary A. Whooley
MP Turakhia
MV Berridge
NB Schiller
NB Schiller
Nelson B. Schiller
NS Anavekar
Q Zhao
RA Wolfe
Ramin Farzaneh-Far
RN Foley
SL Kopecky
SS Ghosh
V Didelez
WJ Remme
WS Peart
X Ren
Publication venue: Public Library of Science
Publication date: 01/10/2010
Field of study

Renalase is a soluble enzyme that metabolizes circulating catecholamines. A common missense polymorphism in the flavin-adenine dinucleotide-binding domain of human renalase (Glu37Asp) has recently been described. The association of this polymorphism with cardiac structure, function, and ischemia has not previously been reported.We genotyped the rs2296545 single-nucleotide polymorphism (Glu37Asp) in 590 Caucasian individuals and performed resting and stress echocardiography. Logistic regression was used to examine the associations of the Glu37Asp polymorphism (C allele) with cardiac hypertrophy (LV mass>100 g/m2), systolic dysfunction (LVEF<50%), diastolic dysfunction, poor treadmill exercise capacity (METS<5) and inducible ischemia.Compared with the 406 participants who had GG or CG genotypes, the 184 participants with the CC genotype had increased odds of left ventricular hypertrophy (OR = 1.43; 95% CI 0.99-2.06), systolic dysfunction (OR = 1.72; 95% CI 1.01-2.94), diastolic dysfunction (OR = 1.75; 95% CI 1.05-2.93), poor exercise capacity (OR = 1.61; 95% CI 1.05-2.47), and inducible ischemia (OR = 1.49, 95% CI 0.99-2.24). The Glu37Asp (CC genotype) caused a 24-fold decrease in affinity for NADH and a 2.3-fold reduction in maximal renalase enzymatic activity.A functional missense polymorphism in renalase (Glu37Asp) is associated with cardiac hypertrophy, ventricular dysfunction, poor exercise capacity, and inducible ischemia in persons with stable coronary artery disease. Further studies investigating the therapeutic implications of this polymorphism should be considered

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

eScholarship - University of California

Bias in random forest variable importance measures: Illustrations, sources and a solution

Author: A Bureau
A Dobra
A Liaw
Achim Zeileis
AG Heidema
AL Boulesteix
AL Boulesteix
Anne-Laure Boulesteix
C Furlanello
C Strobl
C Strobl
C Strobl
Carolin Strobl
DN Politis
EC Gunther
H Kim
I Kononenko
J Friedman
J Friedman
K Arun
KL Lunetta
L Breiman
L Breiman
L Breiman
M van der Laan
MM Ward
MP Cummings
MP Cummings
MR Segal
P Bühlmann
PJ Bickel
R Development Core Team
R Díaz-Uriarte
R Guha
T Hothorn
T Hothorn
TM Therneau
Torsten Hothorn
V Svetnik
X Huang
Y Qi
Y Shih
Publication venue: BioMed Central
Publication date: 01/01/2006
Field of study

BACKGROUND: Variable importance measures for random forests have been receiving increased attention as a means of variable selection in many classification tasks in bioinformatics and related scientific fields, for instance to select a subset of genetic markers relevant for the prediction of a certain disease. We show that random forest variable importance measures are a sensible means for variable selection in many applications, but are not reliable in situations where potential predictor variables vary in their scale of measurement or their number of categories. This is particularly important in genomics and computational biology, where predictors often include variables of different types, for example when predictors include both sequence data and continuous variables such as folding energy, or when amino acid sequence data show different numbers of categories. RESULTS: Simulation studies are presented illustrating that, when random forest variable importance measures are used with data of varying types, the results are misleading because suboptimal predictor variables may be artificially preferred in variable selection. The two mechanisms underlying this deficiency are biased variable selection in the individual classification trees used to build the random forest on one hand, and effects induced by bootstrap sampling with replacement on the other hand. CONCLUSION: We propose to employ an alternative implementation of random forests, that provides unbiased variable selection in the individual classification trees. When this method is applied using subsampling without replacement, the resulting variable importance measures can be used reliably for variable selection even in situations where the potential predictor variables vary in their scale of measurement or their number of categories. The usage of both random forest algorithms and their variable importance measures in the R system for statistical computing is illustrated and documented thoroughly in an application re-analyzing data from a study on RNA editing. Therefore the suggested method can be applied straightforwardly by scientists in bioinformatics research

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Open Access LMU

Elektronische Publikationen der Wirtschaftsuniversität Wien